Agenda

  • Questions arising from lessons, tasks
  • Examples from lessons 15-17: linear models and smooths

The plan for the live coding is for us to work together to solve problems using tools from this week’s lessons. As with last week, we’ll start with some repeated examples from lessons to remind ourselves of the basics, then try to use the tools in new ways, and finally switch perspectives to try to solve a real problem with each tool.

In this live coding session I will focus mostly on adding lines and smooths to a scatterplot. We can look at lm, gam, loess, augment, tidy and glance functions for working with models outside of a visualization too.

Linear models: linear regression

Let’s try to work step-by-step through an old example from Healy’s book: life expectancy as a function of year, grouped by continents.

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + 
  geom_line(aes(group = country), alpha = 0.5) 

Put each continent on its own facet. Add a smooth for each continent. Show the country data as lines. Make the country lines fade into the background.

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + 
  geom_line(aes(group = country), alpha = 0.5) +
  facet_grid( ~ continent)  +
  geom_smooth(method = "lm", formula = y ~ x, size = 2) +
  theme_bw()

Linear models: other regression lines

Polynomials.

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + 
  geom_line(aes(group = country), alpha = 0.5) +
  facet_grid( ~ continent)  +
  geom_smooth(method = "lm", formula = y ~ poly(x,2), size = 2) +
  theme_bw()

Log transforms. Other transformations?

What is the relationship between penguin body mass and flipper length? Is this the same for each species? For male and female penguins?

What if body mass is not a linear feature of a length?

\[ M = k L^b\] Take the log of this equation

\[ \log M = \log k + b \log L\]

penguins %>% ggplot(aes(y = body_mass_g, x = flipper_length_mm, color = species)) +
  geom_point() +
  scale_x_log10() + scale_y_log10() +
  geom_smooth(method = "lm", formula = y ~ x, show.legend = FALSE) +
  theme_bw()

Linear models: quantile regression

Find the quantile regression lines for life expectancy within Europe over time.

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + 
  geom_line(aes(group = country), alpha = 0.5) +
  facet_grid( ~ continent)  +
  geom_quantile(method = "rq", formula = y ~ x, size = 2, color = "blue",
                quantiles = c(0.10, 0.90)) +
  theme_bw()

quantreg::rq(lifeExp ~ year1950, 
             data = gapminder %>% filter(continent == "Europe") %>% mutate(year1950 = year- 1950),
             tau = c(0.10, 0.50, 0.90))
## Call:
## quantreg::rq(formula = lifeExp ~ year1950, tau = c(0.1, 0.5, 
##     0.9), data = gapminder %>% filter(continent == "Europe") %>% 
##     mutate(year1950 = year - 1950))
## 
## Coefficients:
##               tau= 0.1   tau= 0.5 tau= 0.9
## (Intercept) 59.8917556 66.3386000   70.789
## year1950     0.2561778  0.2173429    0.173
## 
## Degrees of freedom: 360 total; 358 residual

Linear regression for penguin, log transformed data.

lm(body_mass_g ~ flipper_length_mm, data = penguins %>% filter(species == "Gentoo"))  
## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins %>% 
##     filter(species == "Gentoo"))
## 
## Coefficients:
##       (Intercept)  flipper_length_mm  
##          -6787.28              54.62
m1 <- penguins %>% filter(species == "Gentoo") %>%
  mutate(log_body_mass = log10(body_mass_g),
         log_flipper_length = log10(flipper_length_mm)) %>%
  lm(log_body_mass ~ log_flipper_length, data = .)  
tidy(m1, conf.int = TRUE)
## # A tibble: 2 x 7
##   term               estimate std.error statistic  p.value conf.low conf.high
##   <chr>                 <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)           -1.82     0.508     -3.57 5.07e- 4    -2.82    -0.810
## 2 log_flipper_length     2.36     0.217     10.9  1.33e-19     1.93     2.79

GAM smooths

Is a GAM or linear model better (judged by the visualization) for the relationship between life expectancy and year? life expectancy and GDP per capita?

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + 
  geom_line(aes(group = country), alpha = 0.5) +
  facet_grid( ~ continent) +
  geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"))

LOESS smooths

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + 
  geom_line(aes(group = country), alpha = 0.5) +
  facet_grid( ~ continent) +
  geom_smooth(method = "loess", formula = y ~ x^2)

gapminder %>% ggplot(aes(x = year, y = lifeExp)) + 
  geom_line(aes(group = country), alpha = 0.5) +
  facet_grid( ~ continent) +
  geom_smooth(method = "loess")
## `geom_smooth()` using formula 'y ~ x'

penguins %>% ggplot(aes(y = body_mass_g, x = flipper_length_mm, color = species)) +
  geom_point() +
  scale_x_log10() + scale_y_log10() +
  geom_smooth(method = "loess", formula = y ~ x, show.legend = FALSE) +
  theme_bw()

m2 <- penguins %>% filter(species == "Gentoo") %>%
  mutate(log_body_mass = log10(body_mass_g),
         log_flipper_length = log10(flipper_length_mm)) %>%
  loess(log_body_mass ~ log_flipper_length, data = .)  
m2
## Call:
## loess(formula = log_body_mass ~ log_flipper_length, data = .)
## 
## Number of Observations: 123 
## Equivalent Number of Parameters: 4.89 
## Residual Standard Error: 0.03095
# tidy(m1, conf.int = TRUE) # doesn't work

Getting model parameters

Use lm, loess together with tidy, glance, and augment to get model parameters, residuals, and fitted values.

Use augment to get predictions and residuals.

augment(m2) %>% ggplot(aes(x = .resid)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

augment(m2) %>% ggplot(aes(x = log_body_mass, y = .fitted)) + geom_point() +
  geom_abline(aes(intercept = 0, slope = 1))

Confidence intervals on predictions. This works for lm models, but is a bit different for loess and gam; see the course notes for more.

augment(m1, interval = "confidence") # about the mean over all birds
## # A tibble: 123 x 11
##    .rownames log_body_mass log_flipper_len… .fitted .lower .upper   .resid
##    <chr>             <dbl>            <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
##  1 1                  3.65             2.32    3.67   3.67   3.68 -0.0210 
##  2 2                  3.76             2.36    3.76   3.75   3.77 -0.00676
##  3 3                  3.65             2.32    3.67   3.66   3.68 -0.0210 
##  4 4                  3.76             2.34    3.71   3.70   3.71  0.0482 
##  5 5                  3.73             2.33    3.69   3.69   3.70  0.0389 
##  6 6                  3.66             2.32    3.67   3.66   3.68 -0.0113 
##  7 7                  3.68             2.32    3.67   3.67   3.68  0.00705
##  8 8                  3.72             2.34    3.71   3.71   3.72  0.00364
##  9 9                  3.64             2.32    3.66   3.66   3.67 -0.0210 
## 10 10                 3.71             2.33    3.69   3.69   3.70  0.0184 
## # … with 113 more rows, and 4 more variables: .std.resid <dbl>, .hat <dbl>,
## #   .sigma <dbl>, .cooksd <dbl>
augment(m1, interval = "prediction") # about the prediction of a single bird.
## # A tibble: 123 x 11
##    .rownames log_body_mass log_flipper_len… .fitted .lower .upper   .resid
##    <chr>             <dbl>            <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
##  1 1                  3.65             2.32    3.67   3.61   3.74 -0.0210 
##  2 2                  3.76             2.36    3.76   3.70   3.83 -0.00676
##  3 3                  3.65             2.32    3.67   3.61   3.73 -0.0210 
##  4 4                  3.76             2.34    3.71   3.65   3.77  0.0482 
##  5 5                  3.73             2.33    3.69   3.63   3.76  0.0389 
##  6 6                  3.66             2.32    3.67   3.61   3.73 -0.0113 
##  7 7                  3.68             2.32    3.67   3.61   3.74  0.00705
##  8 8                  3.72             2.34    3.71   3.65   3.77  0.00364
##  9 9                  3.64             2.32    3.66   3.60   3.73 -0.0210 
## 10 10                 3.71             2.33    3.69   3.63   3.76  0.0184 
## # … with 113 more rows, and 4 more variables: .std.resid <dbl>, .hat <dbl>,
## #   .sigma <dbl>, .cooksd <dbl>